Automated transcription system and method using two speech converting instances and computer-assisted correction

ABSTRACT

A system for automating transcription services for one or more users. This system receives a voice dictation file from a current user, which is automatically converted into a first written text based on a set of conversion variables. The same voice dictation file is automatically converted into a second written text based on a second set of conversion variables. The first and second sets of conversion variables have at least one difference, such as different speech recognition programs, different vocabularies, and the like. The system further includes a program for manually editing a copy of the first and second written text to create a verbatim text of the voice dictation file. This verbatim text can be delivered to the current user as transcribed text. The verbatim text can also be fed back into each speech recognition instance to improve the accuracy of each instance with respect to the human voice in the file.

This application claims the benefit of Provisional application Ser. No.60/120,997, filed Feb. 19, 1999.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates in general to computer speech recognitionsystems and, in particular, to a system and method for automating thetext transcription of voice dictation by various end users.

2. Background Art

Speech recognition programs are well known in the art. While theseprograms are ultimately useful in automatically converting speech intotext, many users are dissuaded from using these programs because theyrequire each user to spend a significant amount of time training thesystem. Usually this training begins by having each user read a seriesof pre-selected materials for approximately 20 minutes. Then, as theuser continues to use the program, as words are improperly transcribedthe user is expected to stop and train the program as to the intendedword thus advancing the ultimate accuracy of the acoustic model.Unfortunately, most professionals (doctors, dentists, veterinarians,lawyers) and business executive are unwilling to spend the timedeveloping the necessary acoustic model to truly benefit from theautomated transcription.

Accordingly, it is an object of the present invention to provide asystem that offers transparent training of the speech recognitionprogram to the end-users.

There are systems for using computers for routing transcription from agroup of end users. Most often these systems are used in largemulti-user settings such as hospitals. In those systems, a voice userdictates into a general-purpose computer or other recordings device andthe resulting file is transferred automatically to a humantranscriptionist. The human transcriptionist transcribes the file, whichis then returned to the original “author” for review. These systems havethe perpetual overhead of employing a sufficient number of humantranscriptionist to transcribe all of the dictation files.

Accordingly it is another object of the present invention to provide anautomated means of translating speech into text where ever suitable soas to minimize the number of human transcriptionist necessary totranscribe audio files coming into the system.

It is an associated object to provide a simplified means for providingverbatim text files for training a user's acoustic model for the speechrecognition portion of the system.

It is another associated object of the present invention to automate apreexisting speech recognition program toward further minimizing thenumber operators necessary to operate the system.

These and other objects will be apparent to those of ordinary skill inthe art having the present drawings, specification and claims beforethem.

SUMMARY OF THE DISCLOSURE

The present disclosure relates to a system and method for substantiallyautomating transcription services for one or more voice users. Inparticular, this system involves using two speech converting instancesto facilitate the establishment of a verbatim transcription text withminimal human transcription.

The system includes means for receiving a voice dictation file from acurrent user. That voice dictation file is fed into first means forautomatically converting the voice dictation file into a first writtentext and second means for automatically converting the voice dictationfile into a second written text. The first and second means have firstand second sets of conversion variables, respectively. These first andsecond sets of conversion variables have at least one difference.

For instance, where the first and second automatic speech convertingmeans each comprise a preexisting speech recognition program, theprograms themselves may be different from each other. Various speechrecognition programs have inherently different speech-to-text conversionapproaches, thus, likely resulting in different conversion on difficultspeech utterances, which, in turn, can be used to establish the verbatimtext. Among the available preexisting speech converting means are DragonSystems' Naturally Speaking, IBM's Via Voice and Philips Corporation'sMagic Speech.

In another approach, the first and second sets of conversion variablescould each comprise a language model (i.e. a general or a specializedlanguage model), which again would likely result in differentconversions on difficult utterances leading to easier establishment ofthe verbatim text. Alternatively, one or more setting associated withthe preexisting speech recognition program(s) being used could bemodified.

In yet another approach, the voice dictation file can be pre-processedprior to its input into one or both of the automatic conversion means.In this way, the conversion variables (e.g. digital word size, samplingrate, and removing particular harmonic ranges) can be differed betweenthe speech conversion instances.

The system further includes means for manually editing a copy of saidfirst and second written texts to create the verbatim text of the voicedictation file. In one approach, the first written text is at leasttemporarily synchronized to the voice dictation file. In this instance,the manual editing means includes means for sequentially comparing acopy of the first and second written texts resulting in a sequentiallist of unmatched words culled from first written text. The manualediting means further includes means for incrementally searching for acurrent unmatched word contemporaneously within a first bufferassociated with the first automatic conversion means containing thefirst written text and a second buffer associated with the sequentiallist. The manual editing means also includes means for correcting thecurrent unmatched word in the second buffer. The correcting meansincluding means for displaying the current unmatched word in a mannersubstantially visually isolated from other text in the first writtentext and means for playing a portion of said synchronized voicedictation recording from the first buffer associated with the currentunmatched word. In one embodiment, the editing means further includesmeans for alternatively viewing said current unmatched word in contextwithin the copy of the first written text.

The system may also include training means to improve the accuracy ofthe speech recognition program.

The application also discloses a method for automating transcriptionservices for one or more voice users in a system including at least onespeech recognition program. The method includes: (1) receiving a voicedictation file from a current voice user, (2) automatically creating afirst written text from the voice dictation file with a speechrecognition program using a first set of conversion variables; (3)automatically creating a second written text from the voice dictationfile with a speech recognition program using a second set of conversionvariables; (4) manually establishing a verbatim file through comparisonof the first and second written texts; and (5) returning the verbatimfile to the current user. Establishing a verbatim file includes (6)sequentially comparing a copy of the first written text with the secondwritten text resulting in a sequential list of unmatched words culledfrom the copy of the first written text, the sequential list having abeginning, an end and a current unmatched word, the current unmatchedword being successively advanced from the beginning to the end; (7)incrementally searching for the current unmatched word contemporaneouslywithin a first buffer associated with the at least one speechrecognition program containing the first written text and a secondbuffer associated with the sequential list; (8) displaying the currentunmatched word in a manner substantially visually isolated from othertext in the copy of the first written text and playing a portion of thesynchronized voice dictation recording from the first buffer associatedwith the current unmatched word, and (9) correcting the currentunmatched word to be a verbatim representation of the portion of thesynchronized voice dictation recording.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 of the drawings is a block diagram of one potential embodiment ofthe present system for substantially automating transcription servicesfor one or more voice users;

FIG. 1 b of the drawings is a block diagram of a general-purposecomputer which may be used as a dictation station, a transcriptionstation and the control means within the present system;

FIG. 2 a of the drawings is a flow diagram of the main loop of thecontrol means of the present system;

FIG. 2 b of the drawings is a flow diagram of the enrollment stageportion of the control means of the present system;

FIG. 2 c of the drawings is a flow diagram of the training stage portionof the control means of the present system;

FIG. 2 d of the drawings is a flow diagram of the automation stageportion of the control means of the present system;

FIG. 3 of the drawings is a directory structure used by the controlmeans in the present system;

FIG. 4 of the drawings is a block diagram of a portion of a preferredembodiment of the manual editing means;

FIG. 5 of the drawings is an elevation view of the remainder of apreferred embodiment of the manual editing means; and

FIG. 6 of the drawings is an illustration of the arrangement of thesystem that present automated transcription system and method using twospeech converting instances and computer-assisted correction.

BEST MODES OF PRACTICING THE INVENTION

While the present invention may be embodied in many different forms,there is shown in the drawings and discussed herein a few specificembodiments with the understanding that the present disclosure is to beconsidered only as an exemplification of the principles of the inventionand is not intended to limit the invention to the embodimentsillustrated.

FIG. 1 of the drawing,s generally shows one potential embodiment of thepresent system for substantially automating transcription services forone or more voice users. The present system must include some means forreceiving a voice dictation file from a current user. This voicedictation file receiving means can be a digital audio recorder, ananalog audio recorder, or standard means for receiving computer files onmagnetic media or via a data connection.

As shown, in one embodiment, the system 100 includes multiple digitalrecording stations 10, 11, 12 and 13. Each digital recording station hasat least a digital audio recorder and means for identifying the currentvoice user.

Preferably, each of these digital recording stations is implemented on ageneral-purpose computer (such as computer 20), although a specializedcomputer could be developed for this specific purpose. Thegeneral-purpose computer, though has the added advantage of beingadaptable to varying uses in addition to operating within the presentsystem 100. In general, the general-purpose computer should have, amongother elements, a microprocessor (such as the Intel Corporation PENTIUM,Cyrix K6 or Motorola 68000 series); volatile and non-volatile memory;one or more mass storage devices (i.e. HDD (not shown), floppy drive 21,and other removable media devices 22 such as a CD-ROM drive, DITTO, ZIPor JAZ drive (from Iomega Corporation) and the like); various user inputdevices, such as a mouse 23, a keyboard 24, or a microphone 25; and avideo display system 26. In one embodiment, the general-purpose computeris controlled by the WINDOWS 9.x operating system. It is contemplated,however, that the present system would work equally well using aMACINTOSH computer or even another operating system such as a WINDOWSCE, UNIX or a JAVA based operating system, to name a few.

Regardless of the particular computer platform used, in an embodimentutilizing an analog audio input (via microphone 25) the general-purposecomputer must include a sound-card (not shown). Of course, in anembodiment with a digital input no sound card would be necessary.

In the embodiment shown in FIG. 1, digital audio recording stations 10,11, 12 and 13 are loaded and configured to run digital audio recordingsoftware on a PENTIUM-based computer system operating under WINDOWS 9.x.Such digital recordings software is available as a utility in theWINDOWS 9.x operating system or from various third party vendor such asThe Programmers' Consortium, Inc. of Oakton, Va. (VOICEDOC). SyntrilliumCorporation of Phoenix, Ariz. (COOL EDIT) or Dragon Systems Corporation(Dragon Naturally Speaking Professional Edition). These various softwareprograms produce a voice dictation file in the form of a “WAV” file.However, as would be known to those skilled in the art, other audio fileformats, such as MP3 or DSS, could also be used to format the voicedictation file, without departing from the spirit of the presentinvention. In one embodiment where VOICEDOC software is used thatsoftware also automatically assigns a file handle to the WAV file,however, it would be known to those of ordinary skill in the art to savean audio file on a computer system using standard operating system filemanagement methods.

Another means for receiving a voice dictation file is dedicated digitalrecorder 14, such as the Olympus Digital Voice Recorder D-1000manufactured by the Olympus Corporation. Thus, if the current voice useris more comfortable with a more conventional type of dictation device,they can continue to use a dedicated digital recorder 14. In order toharvest the digital audio text file, upon completion of a recording,dedicated digital recorder 14 would be operably connected to one of thedigital audio recording stations, such as 13, toward downloading thedigital audio file into that general-purpose computer. With thisapproach, for instance, no audio card would be required.

Another alternative for receiving the voice dictation file may consistof using one form or another of removable magnetic media containing apre-recorded audio file. With this alternative an operator would inputthe removable magnetic media into one of the digital audio recordingstations toward uploading the audio file into the system.

In some cases it may be necessary to pre-process the audio files to makethem acceptable for processing by the speech recognition software. Forinstance, a DSS file format may have to be changed to a WAV file format,or the sampling rate of a digital audio file may have to be upsampled ordownsampled. For instance, in use the Olympus Digital Voice Recorderwith Dragon Naturally Speaking, Olympus' 8 MHz rate needs to beupsampled to 11 MHz. Software to accomplish such pre-processing isavailable from a variety of sources including Syntrillium Corporationand Olympus Corporation.

The other aspect of the digital audio recording stations is some meansfor identifying the current voice user. The identifying means mayinclude keyboard 24 upon which the user (or a separate operator) caninput the current user's unique identification code. Of course, the useridentification can be input using a myriad of computer input devicessuch as pointing devices (e.g. mouse 23), a touch screen (not shown), alight pen (not shown), bar-code reader (not shown) or audio cues viamicrophone 25, to name a few.

In the case of a first time user the identifying means may also assignthat user an identification number after receiving potentiallyidentifying information from that user, including: (1) name; (2)address; (3) occupation; (4) vocal dialect or accent; etc. As discussedin association with the control means, based upon this inputinformation, a voice user profile and a sub-directory within the controlmeans are established. Thus, regardless of the particular identificationmeans used, a user identification must be established for each voiceuser and subsequently provided with a corresponding digital audio filefor each use such that the control means can appropriately route and thesystem ultimately transcribe the audio.

In one embodiment of the present invention, the identifying means mayalso seek the manual selection of a specialty vocabulary. It iscontemplated that the specialty vocabulary sets may be general forvarious users such as medical (i.e. Radiology, Orthopedic Surgery,Gynecology) and legal (i.e. corporate, patent, litigation) or highlyspecific such that within each specialty the vocabulary parameters couldbe further limited based on the particular circumstances of a particulardictation file. For instance, if the current voice user is a Radiologistdictating, the reading of a abdominal CAT scan the nomenclature ishighly specialized and different from the nomenclature for a renalultrasound. By narrowly segmenting each selectable vocabulary set anincrease in the accuracy of the automatic speech converter is likely.

As shown in FIG. 1, the digital audio recording stations may be operablyconnected to system 100 as part of computer network 30 or,alternatively, they may be operably connected to the system via internethost 15. As shown in FIG. 1 b, the general-purpose computer can beconnected to both network jack 27 and telephone jack. With the use of aninternet host, connection may be accomplished by e-mailing the audiofile via the Internet. Another method for completing such connection isby way of direct modem connection via remote control software, such asPC ANYWHERE which is available from Symantec Corporation of Cupertino,Calif. It is also possible, if the IP address of digital audio recordingstation 10 or internet host 15 is known to transfer the audio file usingbasic file transfer protocol. Thus, as can be seen from the foregoing,the present system allows great flexibility for voice users to provideaudio input into the system.

Control means 200 controls the flow of voice dictation file based uponthe training status of the current voice user. As shown in FIGS. 2 a, 2b, 2 c, 2 d, control means 200 comprises a software program operating ongeneral purpose computer 40. In particular, the program is initializedin step 201 where variable are set, buffers cleared and the particularconfiguration for this particular installation of the control means isloaded. Control means continually monitors a target directory (such as“current” (shown in FIG. 3)) to determine whether a new file has beenmoved into the target, step 202. Once a new file is found (such as“6723.id” (shown in FIG. 3)), a determination is made as to whether ornot the current user 5 (shown in FIG. 1) is a new user, step 203.

For each new user (as indicated by the existence of a “.pro” file in the“current” subdirectory), a new subdirectory is established, step 204(such as the “usern” subdirectory (shown in FIG. 3)). This subdirectoryis used to store all of the audio files (“xxxx.wav”), written text(“xxxx.wrt”), verbatim text (“xxxx.vb”), transcription text (“xxxx.txt”)and user profile (“usern.pro”) for that particular user. Each particularjob is assigned a unique number “xxxx” such that all of the filesassociated with a job can be associated by that number. With thisdirectory structure, the number of users is practically limited only bystorage space within general-purpose computer 40.

Now that the user subdirectory has been established, the user profile ismoved to the subdirectory, step 205. The contents of this user profilemay vary between systems. The contents of one potential user profile isshown in FIG. 3 as containing the user name, address, occupation andtraining status. Aside from the training, status variable, which isnecessary, the other data is useful in routing and transcribing theaudio files.

The control means, having selected one set of files by the handle,determines the identity of the current user by comparing the “.id” filewith its “user.tbl.” step 206. Now that the user is known the userprofile may be parsed from that user's subdirectory and the currenttraining status determined, step 207. Steps 208-211 are the triage ofthe current training status is one of: enrollment, training, automateand stop automation.

Enrollment is the first stage in automating transcription services. Asshown in FIG. 2 b, the audio file is sent to transcription, step 301. Inparticular, the “xxxx. wav” file is transferred to transcriptioniststations 50 and 51. In a preferred embodiment, both stations aregeneral-purpose computers, which run both an audio player and manualinput means. The audio player is likely to be a digital audio player,although it is possible that an analog audio file could be transferredto the stations. Various audio players are commonly available includinga utility in the WINDOWS 9.x operating system and various other thirdparties such from The Programmers' Consortium, Inc. of Oakton, Va.(VOICESCRIBE). Regardless of the audio player used to play the audiofile, manual input means is running on the computer at the same time.This manual input means may comprise any of text editor or wordprocessor (such as MS WORD, WordPerfect, AmiPro or Word Pad) incombination with a keyboard, mouse, or other user-interface device. Inone embodiment of the present invention, this manual input means may,itself, also be speech recognition software, such as Naturally Speakingfrom Dragon Systems of Newton, Mass., Via Voice from IBM Corporation ofArmonk, N.Y., or Speech Magic from Philips Corporation of Atlanta, Ga.Human transcriptionist 6 listens to the audio file created by currentuser 5 and as is known, manually inputs the perceived contents of thatrecorded text, thus establishing the transcribed file, step 302. Beinghuman, human transcriptionist 6 is likely to impose experience,education and biases on the text and thus not input a verbatimtranscript of the audio file. Upon completion of the humantranscription, the human transcriptionist 6 saves the file and indicatesthat it is ready for transfer to the current users subdirectory as“xxxx.txt”, step 303.

Inasmuch as this current user is only at the enrollment stage, a humanoperator will have to listen to the audio file and manually compare itto the transcribed file and create a verbatim file, step 304. Thatverbatim file “xxxx.vb” is also transferred to the current user'ssubdirectory, step 305. Now that verbatim text is available, controlmeans 200 starts the automatic speech conversion means, step 306. Thisautomatic speech conversion means may be a preexisting program, such asDragon System's Naturally Speaking, IBM's Via Voice or Philips' SpeechMagic, to name a few. Alternatively, it could be a unique program thatis designed to specifically perform automated speech recognition.

In a preferred embodiment, Dragon Systems' Naturally Speaking has beenused by running an executable simultaneously with Naturally Speakingthat feeds phantom keystrokes and mousing operations through theWIN32API, such that Naturally Speaking believes that it is interactingwith a human being, when in fact it is being controlled by control means200. Such techniques are well known in the computer software testing artand, thus, will not be discussed in detail. It should suffice to saythat by watching the application flow of any speech recognition programan executable to mimic the interactive manual steps can be created.

If the current user is a new user, the speech recognition program willneed to establish the new user, step 307. Control means provides thenecessary information from the user profile found in the current user'ssubdirectory. All speech recognition require significant training toestablish an acoustic model of a particular user. In the case of Dragon,initially the program seeks approximately 20 minutes of audio usuallyobtained by the user reading a canned text provided by Dragon Systems.There is also functionality built into Dragon that allows “mobiletraining.” Using this feature, the verbatim file and audio file are fedinto the speech recognition program to beginning training the acousticmodel for that user, step 308. Regardless of the length of that audiofile, control means 200 closes the speech recognition program at thecompletion of the file, step 309.

As the enrollment step is too soon to use the automatically createdtext, a copy of the transcribed file is sent to the current user usingthe address information contained in the user profile, step 310. Thisaddress can be a street address or an e-mail address. Following thattransmission, the program returns to the main loop on FIG. 2 a.

After a certain number of minutes of training have been conducted for aparticular user, that user's training status may be changed fromenrollment to training. The border for this change is subjective, butperhaps a good rule of thumb is once Dragon appears to be creatingwritten text with 80% accuracy or more, the switch between states can bemade. Thus, for such a user the next transcription event will promptcontrol means 200 into the training state. As shown in FIG. 2 c, steps401-403 are the same human transcription steps as steps 301-303 in theenrollment phase. Once the transcribed file is established, controlmeans 200 starts the automatic speech conversion means (or speechrecognition program) and selects the current user, step 404. The audiofile is fed into the speech recognition program and a written text isestablished within the program buffer, step 405. In the case of Dragon,this buffer is given the same file handle on very instance of theprogram. Thus, that buffer can be easily copied using standard operatingsystem commands and manual editing can begin, step 406.

In one particular embodiment utilizing the VOICEWARE system from TheProgrammers' Consortium, Inc. of Oakton, Va., the user inputs audio intothe VOICEWARE system's VOICEDOC program, thus, creating a “.wav” file.In addition, before releasing this “.wav” file to the VOICEWARE server,the user selects a “transcriptionist.” This “transcriptionist” may be aparticular human transcriptionist or may be the “computerizedtranscriptionist.” If the user selects a “computerized transcriptionist”they may also select whether that transcription is handled locally orremotely. This file is assigned a job number by the VOICEWARE server,which routes the job to the VOICESCRIBE portion of the system. Normally,VOICESCRIBE is used by the human transcriptionist to receive andplayback the job's audio (“.wav”) file. In addition, the audio file isgrabbed by the automatic speech conversion means. In this VOICEWAREsystem embodiment, by placing VOICESCRIBE in “auto mode” new jobs (i.e.an audio file newly created by VOICEDOC) are automatically downloadedfrom the VOICEWARE server and a VOICESCRIBE window having a window titleformed by the job number of the current “.wav” file. An executable file,running in the background “sees” the VOICESCRIBE window open and usingthe WIN32API determines the job number from the VOICESCRIBE windowtitle. The executable file then launches the automatic speech conversionmeans. In Dragon System's Naturally Speaking, for instance, there is abuilt in function for performing speech recognition on a preexisting“.wav” file. The executable program feeds phantom keystrokes toNaturally Speaking to open the “.wav” file from the “current” directory(see FIG. 3) having the job number of the current job.

In this embodiment, after Naturally Speaking has completed automaticallytranscribing the contexts of the “.wav” file, the executable fileresumes operation by selecting all of the text in the open NaturallySpeaking window and copying it to the WINDOWS 9.x operating systemclipboard. Then, using the clipboard utility, save the clipboard as atext file using the current job number with a “dmt” suffix. Theexecutable file then “clicks” the “complete” button in VOICESCRIBE toreturn the “dmt” file to the VOICEWARE server. As would be understood bythose of ordinary skill in the art, the foregoing procedure can be doneutilizing other digital recording software and other automatic speechconversion means. Additionally, functionality analogous to the WINDOWSclipboard exists in other operating systems. It is also possible torequire human intervention to activate or prompt one or more of theforegoing steps. Further, although, the various programs executingvarious steps of this could be running on a number of interconnectedcomputers (via a LAN, WAN, internet connectivity, email and the like),it is also contemplated that all of the necessary software can berunning on a single computer.

Another alternative approach is also contemplated wherein the userdictates directly into the automatic speech conversion means and theVOICEWARE server picks up a copy in the reverse direction. This approachworks as follows, without actually recording any voice, the user clickson the “complete” button in VOICEDOC, thus, creating an empty “.wav”file. This empty file is nevertheless assigned a unique job number bythe VOICEWARE server. The user (or an executable file running in thebackground) then launches the automatic speech conversion means and theuser dictates directly into that program, in the same manner previouslyused in association with such automatic speech conversion means. Uponcompletion of the dictation, the user presses a button labeled “return”(generated by a background executable file), which executable thencommences a macro that gets the current job number from VOICEWARE (inthe manner describe above), selects all of the text in the document andcopies it to the clipboard. The clipboard is then saved to the file“<jobnumber>.dmt,” as discussed above. The executable then “clicks” the“complete” button (via the WIN32API) in VOICESCRIBE, which effectivelyreturns the automatically transcribed text file back to the VOICEWAREserver, which, in turn, returns the completed transcription to theVOICESCRIBE user. Notably, although, the various programs executingvarious steps of this could be running on a number of interconnectedcomputers (via a LAN, WAN, internet connectivity, email and the like),it is also contemplated that all of the necessary software can berunning on a single computer. As would be understood by those ofordinary skill in the art, the foregoing procedure can be done utilizingother digital recording software and other automatic speech conversionmeans. Additionally, functionality analogous to the WINDOWS clipboardexists in other operating systems. It is also possible to require humanintervention to activate or prompt one or more of the foregoing steps.

Manual editing is not an easy task. Human beings are prone to errors.Thus, the present invention also includes means for improving on thattask. As shown in FIG. 4, the transcribed file (“3333.txt”) and the copyof the written text (“3333.wrt”) are sequentially compared word by word406 a toward establishing sequential list of unmatched words 406 b thatare culled from the copy of the written text. This list has a beginningand an end and pointer 406 c to the current unmatched word. Underlyingthe sequential list is another list of objects which contains theoriginal unmatched words, as well as the words immediately before andafter that unmatched word, the starting location in memory of eachunmatched word in the sequential list of unmatched words 406 b and thelength of the unmatched word.

As shown in FIG. 5, the unmatched word pointed at by pointer 406 c fromlist 406 b is displayed in substantial visual isolation from the othertext in the copy of the written text on a standard computer monitor 500in an active window 501. As shown in FIG. 5, the context of theunmatched word can be selected by the operator to be shown within thesentence it resides, word by word or in phrase context, by clicking onbuttons 514, 515, and 516, respectively.

Associated with active window 501 is background window 502, whichcontains the copy of the written text file. As shown in backgroundwindow 502, a incremental search has located (see pointer 503) the nextoccurrence of the current unmatched word “cash.” Contemporaneouslytherewith, within window 505 containing the buffer from the speechrecognition program, the same incremental search has located (seepointer 506) the next occurrence of the current unmatched word. A humanuser will likely only being viewing active window 501 activate the audioreplay from the speech recognition program by clicking on “play” button510, which plays the audio synchronized to the text at pointer 506.Based on that snippet of speech, which can be played over and over byclicking on the play button, the human user can manually input thecorrection to the current unmatched word via keyboard, mousing actions,or possibly even audible cues to another speech recognition programrunning within this window.

In the present example, even if the choice of isolated context offeredby buttons 514, 515 and 516, it may still be difficult to determine thecorrect verbatim word out-of-context, accordingly there is a switchwindow button 513 that will move background window 502 to the foregroundwith visible pointer 503 indicating the current location within the copyof the written text. The user can then return to the active window andinput the correct word, “trash.” This change will only effect the copyof the written text displayed in background window 502.

When the operator is ready for the next unmatched word, the operatorclicks on the advance button 511, which advances pointer 406 c down thelist of unmatched words and activates the incremental search in bothwindow 502 and 505. This unmatched word is now displayed in isolationand the operator can play the synchronized speech from the speechrecognition program and correct this word as well. If at any point inthe operation, the operator would like to return to a previous unmatchedword, the operator clicks on the reverse button 512, which moves pointer406 c back a word in the list and causes a backward incremental searchto occur. This is accomplished by using the underlying list of objectswhich contains the original unmatched words. This list is traversed inobject by object fashion, but alternatively each of the records could bepadded such that each item has the same word size to assist inbi-directional traversing of the list. As the unmatched words in thisunderlying list are read only it is possible to return to the originalunmatched word such that the operator can determine if a differentcorrection should have been made.

Ultimately, the copy of the written text is finally corrected resultingin a verbatim copy, which is saved to the user's subdirectory. Theverbatim file is also passed to the speech recognition program fortraining, step 407. The new (and improved) acoustic model is saved, step408, and the speech recognition program is closed, step 409. As thesystem is still in training, the transcribed file is returned to theuser, as in step 310 from the enrollment phase.

As shown in FIG. 4, the system may also include means for determiningthe accuracy rate from the output of the sequential comparing means.Specifically, by counting the number of words in the written text andthe number of words in list 406 b the ratio of words in said sequentiallist to words in said written text can be determined, thus providing anaccuracy percentage. As before, it is a matter of choice when to advanceusers from one stage to another. Once that goal is reached, the user'sprofile is changed to the next stage, step 211.

One potential enhancement or derivative functionality is provided by thedetermination of the accuracy percentage. In one embodiment, thispercentage could be used to evaluate a human transcriptionist's skills.In particular, by using either a known verbatim file or awell-established user, the associated “.wav” file would be played forthe human transcriptionist and the foregoing comparison would beperformed on the transcribed text versus the verbatim file created bythe foregoing process. In this manner, additional functionality can beprovided by the present system.

As understood, currently, manufacturers of speech recognition programsuse recording of foreign languages, dictions, etc. with manuallyestablished verbatim files to program speech models. It should bereadily apparent that the foregoing manner of establishing verbatim textcould be used in the initial development of these speech filessimplifying this process greatly.

Once the user has reached the automation stage, the greatest benefits ofthe present system can be achieved. The speech recognition software isstarted, step 600, and the current user selected, step 601. If desired,a particularized vocabulary may be selected, step 602. Then automaticconversion of the digital audio file recorded by the current user maycommence, step 603. When completed, the written file is transmitted tothe user based on the information contained in the user profile, step604 and the program is returned to the main loop.

Unfortunately, there may be instances where the voice users cannot useautomated transcription for a period of time (during an illness, afterdental work, etc.) because their acoustic model has been temporarily (oreven permanently) altered. In that case, the system administrator mayset the training status variable to a stop automation state in whichsteps 301, 302, 303, 305 and 310 (see FIG. 2 b) are the only stepsperformed.

FIG. 6 of the drawings depicts another potential arrangement of variouselements associated with the present invention. In this arrangement, asbefore, a user verbally dictates a document that they desire to havetranscribed, which is saved as a voice dictation file 700 in one of themanners described above. In this embodiment—rather than have a humantranscriptionist ever produce a transcribed file—the voice dictationfile is automatically converted into written text at least twice.

After that double automatic text conversation, the resulting first andsecond written text files are compared one to another using manual copyediting means (as described above in association with FIGS. 4 and 5)facilitating a human operator in expeditiously and manually correctingthe second written text file.

In this manner, it is believed that transcription service can beprovided with far less human transcriptionist effort. The key toobtaining a sufficiently accurate written text for delivery to the enduser is to differ the speech-to-text conversion in some way between thefirst and second runs. In particular, between the first and secondconversion step the system may change one or more of the following:

-   -   (1) speech recognition programs (e.g. Dragon Systems' Naturally        Speaking, IBM's Via Voice or Philips Corporation's Magic        Speech);    -   (2) language models within a particular speech recognition        program (e.g. general English versus a specialized vocabulary        (e.g. medical, legal));    -   (3) settings within a particular speech recognition program        (e.g. “most accurate” versus “speed”); and/or    -   (4) the voice dictation file by pre-processing same with a        digital signal processor (such as Cool Edit by Syntrillium        Corporation of Phoenix, Ariz. or a programmed DSP56000 IC from        Motorola, Inc.) by changing the digital word size, sampling        rate, removing particular harmonic ranges and other potential        modifications.        By changing one or more of the foregoing “conversion variables”        it is believed that the second speech-to-text conversion will        produce a slightly different written text than the first        speech-to-text conversion and that by comparing the two        resulting written texts using the novel manual editing means        disclosed herein, a human operator can review the differences in        the manner noted above and quickly produce a verbatim text for        delivery to the end user. Thus, in this manner, it is believed        that fully automated transcription can be achieved with less        human intervention that in the other approaches disclosed.

This system and the underlying method is illustrated in FIG. 6. Itshould be noted that while two automatic speech conversion means 702 and703 are depicted, there may be only a single instance of a speechrecognition program running on a single computer, but using differentconversion variables between iterations of conversion of the voicedictation file. Of course, it is equally possible to have multipleinstances of a speech recognition program running on a single machine oreven on separate machines interconnected by a computerized network (LAN,WAN, peer-to-peer, or the like) as would be known to those of ordinaryskill in the art.

Similarly, while manual editing means 705 is depicted as being separatefrom the automated speech conversion means, it too may be implemented onthe same computer as one or both of the instances of the automaticspeech conversion means. Likewise, the manual editing means may also beimplemented on a separate computer, as well interconnected with theother computers along a computerized network.

Finally, Digital Signal Processor 701 is shown to illustrate that oneapproach to changing the conversion variables is to alter the voicedictation file input to one or both of the instances of the automaticspeech conversion means. Again, this digital signal processor can beimplemented on the same computer as any one or all of the foregoingfunctional blocks or on a separate computer interconnected with theother computers via a computerized network.

It is contemplated that the foregoing case in which two iterations ofspeech-to-text conversion is used could be extrapolated to a case whereeven more conversion iterations are performed each using various sets ofconversion variables with text comparison being performed between uniquepairs of written text outputs and thereafter between each other with aresulting increase in the accuracy of the automatic transcriptionleaving fewer words to be considered in manual editing.

The foregoing description and drawings merely explain and illustrate theinvention and the invention is not limited thereto. Those of the skillin the art who have the disclosure before them will be able to makemodifications and variations therein without departing from the scope ofthe present invention. For instance, it is possible to implement all ofthe elements of the present system on a single general-purpose computerby essentially time sharing the machine between the voice user,transcriptionist and the recognition program. The resulting cost savingmakes this system accessible to more types of office situations notsimply large medical clinics, hospital, law firms or other largeentities.

1. A system for substantially automating transcription services for one or more voice users, comprising means for receiving a voice dictation file from a current user, said current user being one of said one or more voice users; first means for automatically converting said voice dictation file into a first written text, said first automatic conversion means having a first set of conversion variables; second means for automatically converting said voice dictation file into a second written text, said second automatic converting means having a second set of conversion variables, said first and second sets of conversion variables having at least one difference; and means for manually editing a copy of said first and second written texts to create a verbatim text of said voice dictation file; wherein said first written text is at least temporarily synchronized to said voice dictation file, and said manual editing means comprises: means for sequentially comparing a copy of said first written text with said second written text resulting in a sequential list of unmatched words culled from said copy of said first written text, said sequential list having a beginning, an end and a current unmatched word, said current unmatched word being successively advanced from said beginning to said end; means for incrementally searching for said current unmatched word contemporaneously within a first buffer associated with said first automatic conversion means containing said first written text and a second buffer associated with said sequential list; and means for correcting said current unmatched word in said second buffer, said correcting means including means for displaying said current unmatched word in a manner substantially visually isolated from other text in said copy of said first written text and means for playing a portion of said synchronized voice dictation recording from said first buffer associated with said current unmatched word.
 2. The invention according to claim 1 wherein said difference between said first and second sets of conversion variables comprises at least one setting associated with said preexisting speech recognition program.
 3. The invention according to claim 2 wherein said editing means further includes means for alternatively viewing said current unmatched word in context within said copy of said first written text.
 4. The invention according to claim 2 wherein said first and second automatic speech converting means each comprises a preexisting speech recognition program intended for human interactive use, each of said first and second automatic speech converting means includes means for automating responses to a series of interactive inquiries from said preexisting speech recognition program.
 5. The invention according to claim 4 wherein said difference between said first and second sets of conversion variables is said preexisting speech recognition program comprising said first and second automatic speech converting means.
 6. The invention according to claim 5 wherein said automatic speech converting means is selected from the group consisting essentially of Dragon Systems' Naturally Speaking, IBM's Via Voice and Philips Corporation's Magic Speech.
 7. The invention according to claim 2 wherein said difference between said first and second sets of conversion variables comprises a language model used in association with said preexisting speech recognition program.
 8. The invention according to claim 7 wherein a generalized language model is used in said first set of conversion variables and a specialized language model is used in said second set of conversion variables.
 9. The invention according to claim 1 wherein said difference between said first and second sets of conversion variables comprises means for pre-processing audio prior to its input to said first automatic conversion means.
 10. The invention according to claim 8 wherein said difference between said first and second sets of conversion variables comprises means for pre-processing audio prior to its input to said second automatic conversion means, wherein said first and second pre-processing variable is different.
 11. The invention according to claim 10 wherein said pre-processing variables is selected from the group consisting essentially of digital word size, sampling rate, and removing particular harmonic ranges.
 12. The invention according to claim 1 wherein said difference between said first and second sets of conversion variables comprises a language model used in association with said preexisting speech recognition program.
 13. The invention according to claim 12 wherein a generalized language model is used in said first set of conversion variables and a specialized language model is used in said second set of conversion variables.
 14. The invention according to claim 1 wherein said difference between said first and second sets of conversion variables comprises means for pre-processing audio prior to its input to said first automatic conversion means.
 15. The invention according to claim 14 wherein said difference between said first and second sets of conversion variables comprises means for pre-processing audio prior to its input to said second automatic conversion means, wherein said first and second pre-processing variable is different.
 16. The invention according to claim 1 further including means for training said automatic speech converting means to achieve higher accuracy with said voice dictation file of current user.
 17. The invention according to claim 16 wherein said training means comprises a preexisting training portion of a preexisting speech recognition program intended for human interactive use, said training means includes means for automating responses to a series of interactive inquiries from said preexisting training portion of said preexisting speech recognition program.
 18. A method for automating transcription services for one or more voice users in a system including at least one speech recognition program, comprising receiving a voice dictation file from a current voice user; automatically creating a first written text from the voice dictation file with a speech recognition program using a first set of conversion variables; automatically creating a second written text from the voice dictation file with a speech recognition program using a second set of conversion variables; manually establishing a verbatim file through comparison of the first and second written texts; and returning the verbatim file to the current user, wherein said step of manually establishing a verbatim file includes the sub-steps of: sequentially comparing a copy of the first written text with the second written text resulting in a sequential list of unmatched words culled from the copy of the first written text, the sequential list having a beginning, an end and a current unmatched word, the current unmatched word being successively advanced from the beginning to the end; incrementally searching for the current unmatched word contemporaneously within a first buffer associated with the at least one speech recognition program containing the first written text and a second buffer associated with the sequential list; and displaying the current unmatched word in a manner substantially visually isolated from other text in the copy of the first written text and playing a portion of the synchronized voice dictation recording from the first buffer associated with the current unmatched word; and correcting the current unmatched word to be a verbatim representation of the portion of the synchronized voice dictation recording.
 19. The invention according to claim 18 further comprising: selecting the first set of conversion variables from available preexisting speech recognition programs; and differently selecting the second set of conversion variables from available preexisting speech recognition programs.
 20. The invention according to claim 18 further comprising: selecting the first set of conversion variables from available language models; and differently selecting the second set of conversion variables from available language models.
 21. The invention according to claim 18 further comprising preprocessing the voice dictation file before automatically creating a first written text, the preprocessing forming at least a part of the first set of conversion variables.
 22. The invention according to claim 21 further comprising preprocessing the voice dictation file differently than the first set of preprocessing conversion variables before automatically creating a second written text, the preprocessing forming at least a part of the second set of conversion variables. 